{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 中核集团也有自己的光伏发电公司"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 将分类预测结果的输出编程Cp*0.03,能够减小预测值错误"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 参赛选手最优MAE0.133，意味着白天平均每点错误13%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "分类模型的任务是将比较有把握的点预测为负类\n",
    "回归模型的任务是将无异常时间内的MAE降低\n",
    "分类模型和回归模型都需要排除异常数据，让异常数据对模型的影响保留在官方的评价方法里"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 0.从外部传入参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import argparse\n",
    "parser = argparse.ArgumentParser()\n",
    "parser.add_argument('Number',type=int,help='输入电站序号')\n",
    "parser.add_argument('Cp',type=int,help='输入电站额定功率')\n",
    "args = parser.parse_args()\n",
    "num=args.Number\n",
    "cp=args.Cp\n",
    "print('正在处理数据集{}，装机功率为{}'.format(num,cp))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1.文件读取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import warnings\n",
    "warnings.simplefilter(\"ignore\")\n",
    "os.environ[\"PYTHONWARNINGS\"] = \"ignore\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "file_path=\" \""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data=pd.read_csv(file_path+'train/train_{}.csv'.format(num),parse_dates=['时间'])\n",
    "test_data=pd.read_csv(file_path+'test/test_{}.csv'.format(num),parse_dates=['时间'])\n",
    "weather_data=pd.read_csv(file_path+'气象数据/电站{}_气象.csv'.format(num),parse_dates=['时间'],encoding='gbk')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if num==7:\n",
    "    train_data=train_data[(train_data[\"时间\"] < \"2018/03/01 00:00\") |(train_data[\"时间\"] > \"2018/04/04 23:45\")]\n",
    "    train_data.reset_index(drop=True,inplace=True)\n",
    "if num == 9:\n",
    "    train_data=train_data[(train_data[\"时间\"] < \"2016/01/01 9:00\") | (train_data[\"时间\"] > \"2017/03/21 23:45\")]\n",
    "    train_data.reset_index(drop=True,inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data=pd.concat([train_data,test_data],ignore_index=True)\n",
    "weather_data.drop_duplicates(['时间'],inplace=True,keep='first')\n",
    "data=data.merge(weather_data[['时间','直辐射','总辐射']],left_on=['时间'],right_on=['时间'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.数据探索"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 获取测试集的时间范围"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('测试集%s的时间范围:' % num)\n",
    "start=test_data['时间'][0]\n",
    "end=test_data.iloc[-1]['时间']\n",
    "print(start,end)\n",
    "print(80*'-')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2 获取出现数据缺失的时间段"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "def find_null_time(name,file_num):\n",
    "    print(name+'数据集'+str(file_num))\n",
    "    temp_data=pd.read_csv(file_path+'{}/{}_{}.csv'.format(name,name,file_num),parse_dates=['时间'] )\n",
    "    null_temp_0=temp_data['时间']\n",
    "    null_temp_1=null_temp_0.shift(-1)\n",
    "    null_temp=pd.concat([null_temp_0,null_temp_1],axis=1)\n",
    "    null_temp.dropna(inplace=True)\n",
    "    new_column_name=['原时间','移位时间']\n",
    "    null_temp.columns=new_column_name\n",
    "    null_temp.head()\n",
    "    null_temp['delta_time']=null_temp['移位时间']-null_temp['原时间']\n",
    "    null_temp['time_trans']=(null_temp['delta_time']/(np.timedelta64(1,'m'))).astype(int)\n",
    "    print(null_temp[null_temp['time_trans']!=15])\n",
    "    print(80*'-')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for name in ['train','test']:\n",
    "    find_null_time(name,num)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3 对电站实际功率为零的特殊样本进行探索"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_error_time=train_data[train_data['实际功率']==0]\n",
    "df_error_time['当日时间']=df_error_time['时间'].dt.time\n",
    "df_error_time['小时']=df_error_time['时间'].apply(lambda x:x.hour)\n",
    "print('实际功率为零的样本数量:',train_data.shape[0],df_error_time.shape[0],round((df_error_time.shape[0])/(train_data.shape[0]),4))\n",
    "df_error_time.to_csv(file_path+'error_time/error_{}.csv'.format(num),encoding='gbk')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.4 查看每日实际功率最大值时间"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import Series\n",
    "train_data['年月日']=train_data['时间'].dt.date\n",
    "train_data['当日时间']=train_data['时间'].dt.time\n",
    "the_max_time=train_data.groupby('年月日').apply(lambda x:x.loc[Series.argmax(x['实际功率'])]['当日时间'])\n",
    "the_max_time.to_csv(file_path+'the_max_time.csv',encoding='gbk')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过对最大值功率在0点分析表明，当天数据量小时应该删除当天数据\n",
    "2017年9月整体发电功率都偏小，怀疑是否有故障的原因"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.5 查看实际功率/实际辐照度的数值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp=train_data[(train_data['实际功率']>0)&(train_data['实际辐照度']>0)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp['实际比例']=temp['实际功率']/temp['实际辐照度']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp.to_csv(file_path+'实际比例_{}.csv'.format(num),encoding='gbk')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.4 通过一些极值来查看规律"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data['辐照度'].argmax()\n",
    "train_data['湿度'].argmin()\n",
    "train_data['实际辐照度'].argmax()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.5 通过日平均值来找寻规律"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data['年月日']=train_data['时间'].dt.date\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "def day_info_plot(info_word,start,end,drop=True):\n",
    "    start_time=pd.to_datetime(start,format='%Y%m%d')\n",
    "    end_time=pd.to_datetime(end,format='%Y%m%d')\n",
    "    sel_data=train_data[(train_data['时间']>=start_time) & (train_data['时间']<=end_time)]\n",
    "    if drop:\n",
    "        sel_data=sel_data[sel_data['实际功率']>0.6]\n",
    "    else: \n",
    "        pass\n",
    "    day_info=sel_data.groupby('年月日')[info_word].agg({'平均值':'mean'})\n",
    "    day_info.reset_index(inplace=True)\n",
    "    plt.figure(figsize=(30,10))\n",
    "    sns.set(context='notebook', style='darkgrid', palette='deep', font='sans-serif')\n",
    "    axes = plt.gca()\n",
    "    axes.set_xlim([start_time,end_time])\n",
    "    sns.scatterplot(x='年月日',y='平均值',data=day_info,legend='full')\n",
    "    plt.savefig(file_path+\"my_plot/train_{}_{}.png\".format(num,info_word))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "for word in ['实际功率','辐照度','实际辐照度','湿度','温度','压强','风向','风速']:\n",
    "    day_info_plot(word,20170101,20170630)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.6 统计每个月早晨/晚上计入功率时间"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime as dt\n",
    "temp_data_0=train_data[(train_data['实际功率']>0.6)&(train_data['当日时间']<dt.time(11,59,59))]\n",
    "the_time=temp_data_0.groupby('年月日').apply(lambda x:x.loc[Series.argmin(x['实际功率'])]['当日时间'])\n",
    "df_the_time=the_time.reset_index()\n",
    "df_the_time.to_csv(file_path+'早晨计入时间_{}.csv'.format(num),encoding='gbk')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp_data_1=train_data[(train_data['实际功率']>0.6)&(train_data['当日时间']>dt.time(11,59,59))]\n",
    "the_time=temp_data_1.groupby('年月日').apply(lambda x:x.loc[Series.argmin(x['实际功率'])]['当日时间'])\n",
    "df_the_time=the_time.reset_index()\n",
    "df_the_time.to_csv(file_path+'晚上计入时间_{}.csv'.format(num),encoding='gbk')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.7 散辐射值唯一，不加入特征"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weather_data['散辐射'].unique()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
